X2013_08_01T20_31_13_000Z <- read_csv("~/Desktop/USC/Master/Fall Semester/Introduction to Health Data Science/Midterm Project/Possible Data/2013-08-01T20_31_13.000Z.csv")
## Rows: 876 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): State, Location 1
## dbl (5): Year, Smoke everyday, Smoke some days, Former smoker, Never smoked
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
smoke<- X2013_08_01T20_31_13_000Z
  1. This data is about “Never smoked trend for 1995-2010”. I’ve found the data from Centers for Disease Control and Prevention website. Then I want to figure out the differences in the number of people within different smoking status in different cities, and I want to find out whether the change in the number of smokers is due to the passing of years.

First, I used the data that has 7 variables: Year, State, Smoke Everyday, Smoke Some Days, Former Smoker, Never Smoked, Location 1. Then I found there are 876 observations, and 56 states (52 major states and 4 islands that belongs to the United States, such as Hawaii) are included.

Each states have 16 differnent observations in 16 years, except four islands. Then I found in most years from 1995 to 2010, 55-56 states. Although some states only have 51 observations, it will not affect our study, since it’s a small amount of data missing, which will not affect the study.

Then I used any(is.na()) function to find the NA values in the four columns: Smoke Everyday, Smoke Some Days, Former Smoker, Never Smoked. I found there were no NA values.

dim(smoke)
## [1] 876   7
table(smoke$Year)
## 
## 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 
##   51   53   54   54   54   54   56   56   56   54   55   55   56   56   56   56
table(smoke$State)
## 
##                                  Alabama 
##                                       16 
##                                   Alaska 
##                                       16 
##                                  Arizona 
##                                       16 
##                                 Arkansas 
##                                       16 
##                               California 
##                                       16 
##                                 Colorado 
##                                       16 
##                              Connecticut 
##                                       16 
##                                 Delaware 
##                                       16 
##                     District of Columbia 
##                                       15 
##                                  Florida 
##                                       16 
##                                  Georgia 
##                                       16 
##                                     Guam 
##                                        7 
##                                   Hawaii 
##                                       15 
##                                    Idaho 
##                                       16 
##                                 Illinois 
##                                       16 
##                                  Indiana 
##                                       16 
##                                     Iowa 
##                                       16 
##                                   Kansas 
##                                       16 
##                                 Kentucky 
##                                       16 
##                                Louisiana 
##                                       16 
##                                    Maine 
##                                       16 
##                                 Maryland 
##                                       16 
##                            Massachusetts 
##                                       16 
##                                 Michigan 
##                                       16 
##                                Minnesota 
##                                       16 
##                              Mississippi 
##                                       16 
##                                 Missouri 
##                                       16 
##                                  Montana 
##                                       16 
##               Nationwide (States and DC) 
##                                       16 
## Nationwide (States, DC, and Territories) 
##                                       16 
##                                 Nebraska 
##                                       16 
##                                   Nevada 
##                                       16 
##                            New Hampshire 
##                                       16 
##                               New Jersey 
##                                       16 
##                               New Mexico 
##                                       16 
##                                 New York 
##                                       16 
##                           North Carolina 
##                                       16 
##                             North Dakota 
##                                       16 
##                                     Ohio 
##                                       16 
##                                 Oklahoma 
##                                       16 
##                                   Oregon 
##                                       16 
##                             Pennsylvania 
##                                       16 
##                              Puerto Rico 
##                                       15 
##                             Rhode Island 
##                                       16 
##                           South Carolina 
##                                       16 
##                             South Dakota 
##                                       16 
##                                Tennessee 
##                                       16 
##                                    Texas 
##                                       16 
##                                     Utah 
##                                       14 
##                                  Vermont 
##                                       16 
##                           Virgin Islands 
##                                       10 
##                                 Virginia 
##                                       16 
##                               Washington 
##                                       16 
##                            West Virginia 
##                                       16 
##                                Wisconsin 
##                                       16 
##                                  Wyoming 
##                                       16
any(is.na(smoke$`Smoke everyday`))
## [1] FALSE
any(is.na(smoke$`Smoke some days`))
## [1] FALSE
any(is.na(smoke$`Former smoker`))
## [1] FALSE
any(is.na(smoke$`Never smoked`))
## [1] FALSE

According the the function of skim, I found the data of the four variable (Smoke Everyday, Smoke Some Days, Former Smoker, Never Smoked) are normally distributed.

library(skimr)
skim(smoke)
Data summary
Name smoke
Number of rows 876
Number of columns 7
_______________________
Column type frequency:
character 2
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
State 0 1.00 4 40 0 56 0
Location 1 37 0.96 11 60 0 54 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 2002.59 4.59 1995.0 1999.00 2003.0 2007.00 2010.0 ▇▆▆▆▆
Smoke everyday 0 1 16.56 3.98 3.6 13.90 16.7 19.10 29.1 ▁▃▇▃▁
Smoke some days 0 1 4.84 1.16 1.3 4.20 4.9 5.53 8.5 ▁▂▇▃▁
Former smoker 0 1 24.32 3.50 9.9 22.90 24.5 26.20 33.4 ▁▁▅▇▂
Never smoked 0 1 54.26 5.60 39.5 51.08 53.5 56.20 83.7 ▁▇▂▁▁
  1. Second, since there are too many states, I tried to find the relationship between year and the average number of each category to have a closer look at the changes in numbers as times goes by. Then according to the four graphs, I found the number of people who smoke everyday shows a decrease as times goes by. The number of people who smoke someday shows a gradual increase as times goes by.The number of people who never smoke becomes more as time goes by.The number of people who is former smokers, in other words, the people who quit smoking, becomes more as times goes by
smoke_avg<-
  smoke%>%
  group_by(Year)%>%
  summarize(
    S_everyday_avg=mean(`Smoke everyday`),
    S_someday_avg=mean(`Smoke some days`),
    S_former_avg=mean(`Former smoker`),
    S_never_avg=mean(`Never smoked`)
  )

smoke_avg%>%
  ggplot(mapping = aes(x=Year,y=S_everyday_avg))+
  geom_point()+
  geom_smooth(method=lm,col="black")
## `geom_smooth()` using formula 'y ~ x'

smoke_avg%>%
  ggplot(mapping = aes(x=Year,y=S_someday_avg))+
  geom_point()+
  geom_smooth(method=lm,col="black")
## `geom_smooth()` using formula 'y ~ x'

smoke_avg%>%
  ggplot(mapping = aes(x=Year,y=S_former_avg))+
  geom_point()+
  geom_smooth(method=lm,col="black")
## `geom_smooth()` using formula 'y ~ x'

smoke_avg%>%
  ggplot(mapping = aes(x=Year,y=S_never_avg))+
  geom_point()+
  geom_smooth(method=lm,col="black")
## `geom_smooth()` using formula 'y ~ x'

  1. After finding the overall relationship between year and four categories of smoking status. I create a new variable called Geo_cate, which is the the category of geographical regions: Northeast, Southwest, West, Southeast, and Midwest. I will divide the 56 states into these five geographical regions. Then I will try to find the relationship between the number of four different smoking status and the regions.

I found these five regions all showed a decreasing tendency in their numbers of people who smoke everyday. And these trend turned into a slight increase for the group of people who smoke somedays. However, for the group of people who are former smokers, the northeast, midwest, southeast, southwest all showed a slight increase; the west showed a slight decrease. Furthermore, for the group of people never smoke, it showed a increase tendency.

smoke<-
  smoke%>%
  mutate(Geo_cate=case_when(
    State == "Connecticut" ~ "Northeast",
    State == "Maine" ~ "Northeast",
    State == "Massachusetts" ~ "Northeast",
    State == "New Hampshire" ~ "Northeast",
    State == "Rhode Island" ~ "Northeast",
    State == "Vermont" ~ "Northeast",
    State == "New Jersey" ~ "Northeast",
    State == "New York" ~ "Northeast",
    State == "Delaware" ~ "Northeast",
    State == "Pennsylvania" ~ "Northeast",
    State == "Alabama" ~ "Southeast",
    State == "Arkansas" ~ "Southeast",
    State == "Florida" ~ "Southeast",
    State == "Georgia" ~ "Southeast",
    State == "Kentucky" ~ "Southeast",
    State == "Louisiana" ~ "Southeast",
    State == "Mississippi" ~ "Southeast",
    State == "North Carolina" ~ "Southeast",
    State == "South Carolina" ~ "Southeast",
    State == "Tennessee" ~ "Southeast",
    State == "Virginia" ~ "Southeast",
    State == "West Virginia" ~ "Southeast",
    State == "Arizona" ~ "Southwest",
    State == "Colorado" ~ "Southwest",
    State == "Utah" ~ "Southwest",
    State == "Nevada" ~ "Southwest",
    State == "New Mexico" ~ "Southwest",
    State == "Idaho" ~ "West",
    State == "Montana" ~ "West",
    State == "Wyoming" ~ "West",
    State == "California" ~ "West",
    State == "Washington" ~ "West",
    State == "Oregon" ~ "West",
    State == "Hawaii" ~ "West",
    State == "Oklahoma" ~ "Southwest",
    State == "Texas" ~ "Southwest",
    State == "Illinois" ~ "Midwest",
    State == "Indiana" ~ "Midwest",
    State == "Iowa" ~ "Midwest",
    State == "Kansas" ~ "Midwest",
    State == "Michigan" ~ "Midwest",
    State == "Minnesota" ~ "Midwest",
    State == "Missouri" ~ "Midwest",
    State == "Nebraska" ~ "Midwest",
    State == "North Dakota" ~ "Midwest",
    State == "Ohio" ~ "Midwest",
    State == "South Dakota" ~ "Midwest",
    State == "Wisconsin" ~ "Midwest",
  ))

#Smoke Everyday
smoke%>%
  filter(Geo_cate=='Northeast')%>%
  ggplot(mapping = aes(x=Year,y=`Smoke everyday`))+
  geom_point()+
  geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'

smoke%>%
  filter(Geo_cate=='Southwest')%>%
  ggplot(mapping = aes(x=Year,y=`Smoke everyday`))+
  geom_point()+
  geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'

smoke%>%
  filter(Geo_cate=='West')%>%
  ggplot(mapping = aes(x=Year,y=`Smoke everyday`))+
  geom_point()+
  geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'

smoke%>%
  filter(Geo_cate=='Southeast')%>%
  ggplot(mapping = aes(x=Year,y=`Smoke everyday`))+
  geom_point()+
  geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'

smoke%>%
  filter(Geo_cate=='Midwest')%>%
  ggplot(mapping = aes(x=Year,y=`Smoke everyday`))+
  geom_point()+
  geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'

# Smoke Somedays
smoke%>%
  filter(Geo_cate=='Northeast')%>%
  ggplot(mapping = aes(x=Year,y=`Smoke some days`))+
  geom_point()+
  geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'

smoke%>%
  filter(Geo_cate=='Southwest')%>%
  ggplot(mapping = aes(x=Year,y=`Smoke some days`))+
  geom_point()+
  geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'

smoke%>%
  filter(Geo_cate=='West')%>%
  ggplot(mapping = aes(x=Year,y=`Smoke some days`))+
  geom_point()+
  geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'

smoke%>%
  filter(Geo_cate=='Southeast')%>%
  ggplot(mapping = aes(x=Year,y=`Smoke some days`))+
  geom_point()+
  geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'

smoke%>%
  filter(Geo_cate=='Midwest')%>%
  ggplot(mapping = aes(x=Year,y=`Smoke some days`))+
  geom_point()+
  geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'

#Former Smoker
smoke%>%
  filter(Geo_cate=='Northeast')%>%
  ggplot(mapping = aes(x=Year,y=`Former smoker`))+
  geom_point()+
  geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'

smoke%>%
  filter(Geo_cate=='Southwest')%>%
  ggplot(mapping = aes(x=Year,y=`Former smoker`))+
  geom_point()+
  geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'

smoke%>%
  filter(Geo_cate=='West')%>%
  ggplot(mapping = aes(x=Year,y=`Former smoker`))+
  geom_point()+
  geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'

smoke%>%
  filter(Geo_cate=='Southeast')%>%
  ggplot(mapping = aes(x=Year,y=`Former smoker`))+
  geom_point()+
  geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'

smoke%>%
  filter(Geo_cate=='Midwest')%>%
  ggplot(mapping = aes(x=Year,y=`Former smoker`))+
  geom_point()+
  geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'

#Never Smoke
smoke%>%
  filter(Geo_cate=='Northeast')%>%
  ggplot(mapping = aes(x=Year,y=`Never smoked`))+
  geom_point()+
  geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'

smoke%>%
  filter(Geo_cate=='Southwest')%>%
  ggplot(mapping = aes(x=Year,y=`Never smoked`))+
  geom_point()+
  geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'

smoke%>%
  filter(Geo_cate=='West')%>%
  ggplot(mapping = aes(x=Year,y=`Never smoked`))+
  geom_point()+
  geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'

smoke%>%
  filter(Geo_cate=='Southeast')%>%
  ggplot(mapping = aes(x=Year,y=`Never smoked`))+
  geom_point()+
  geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'

smoke%>%
  filter(Geo_cate=='Midwest')%>%
  ggplot(mapping = aes(x=Year,y=`Never smoked`))+
  geom_point()+
  geom_smooth(method = lm,col='black')
## `geom_smooth()` using formula 'y ~ x'

  1. After finding the relationship between four geographical regions and the number of four smoking status. I want to find more direct relationship, the one between each city and the number of four smoking status. Thus, I tried to apply the leaflet to see how the data are spreaded, and where it is centered. According to the four leaflet plots, I found the the concentration of population of four smoking status. The number of the people who smoking everyday and the people who is a former smoker tend to be around 20-30, since the color of the data point seems orange or red. And the number of people who never smoke and smoke someday tend to be around 5-15, since the color of the data point seems green.
dat <- c("Alabama (32.840569999605975, -86.63186000013877)",
"Alaska (64.84507999974238, -147.72205999986895)",
"Arizona (34.86596999961597, -111.76380999973156)",
"Arkansas (34.748649999697875, -92.27448999971358)",
"California (37.638300000444815, -120.99958999997835)",
"Colorado (38.842890000173554, -106.13314000041055)",
"Connecticut (41.56265999995918, -72.6498400002157)",
"Delaware (39.00883000020451, -75.57774000040052)",
"District of Columbia (38.89036999987576, -77.03195999965413)",
"Florida (28.932039999846268, -81.9289599999039)",
"Georgia (32.83967999993223, -83.62758000031658)",
"Hawaii (21.304850000427336, -157.85774999956269)",
"Idaho (43.682590000228515, -114.36368000023168)",
"Illinois (40.485010000411364, -88.99770999971656)",
"Indiana (39.76690999989677, -86.14996000035359)",
"Iowa (42.469390000048634, -93.81649000001335)",
"Kansas (38.34774000000118, -98.20077999969709)",
"Kentucky (37.645969999815804, -84.77496999996538)",
"Louisiana (31.31265999975932, -92.44567999993188)",
"Maine (45.25423000041434, -68.9850299999344)",
"Maryland (39.29057999976732, -76.6092600004485)",
"Massachusetts (42.27687000005062, -72.08269000004333)",
"Michigan (44.661320000317914, -84.71438999959867)",
"Minnesota (46.3556499998478, -94.79419999982997)",
"Mississippi (32.7455100000866, -89.53803000008429)",
"Missouri (38.63578999960896, -92.5663000000448)",
"Montana (47.06653000015956, -109.42441999998289)",
"Nebraska (41.6410400000961, -99.36572999973953)",
"Nevada (39.49323999972637, -117.07183999971608)",
"New Hampshire (43.65595000019255, -71.50036000041354)",
"New Jersey (40.13056999960594, -74.2736899996936)",
"New Mexico (34.52088000011207, -106.24057999976702)",
"New York (42.82699999955048, -75.54396999981549)",
"North Carolina (35.46624999963797, -79.1593199999179)",
"North Dakota (47.475320000018144, -100.11841999998285)",
"Ohio (40.06020999969189, -82.40426000019869)",
"Oklahoma (35.4720099999617, -97.52034999975251)",
"Oregon (44.567449999917756, -120.15502999983448)",
"Pennsylvania (40.79372999993973, -77.86069999960512)",
"Rhode Island (41.70828000002217, -71.5224700001902)",
"South Carolina (33.99855000018255, -81.0452500001872)",
"South Dakota (44.353130000049646, -100.37353000040906)",
"Tennessee (35.68094000038087, -85.77449000011325)",
"Texas (31.82724000022597, -99.42676999973554)",
"Utah (39.36070000030492, -111.58712999994941)",
"Vermont (43.625379999687425, -72.51764000028561)",
"Virginia (37.54268000028196, -78.45789000012326)",
"Washington (47.522280000022135, -120.47001000026114)",
"West Virginia (38.66550999958696, -80.71263999973604)",
"Wisconsin (44.39319000021851, -89.81636999977553)",
"Wyoming (43.23553999957147, -108.10982999975454)")
dat <- data.frame(state = dat, stringsAsFactors = FALSE)

dat_new <- data.frame(
  state = gsub("\\s*\\(.+", "", dat$state, perl = TRUE),
  lat   = stringr::str_extract(dat$state, "(?<=\\()[0-9.-]+"),
  lon   = stringr::str_extract(dat$state, "[0-9.-]+(?=\\))")
)

dat_new$lon <- as.numeric(dat_new$lon)
dat_new$lat <- as.numeric(dat_new$lat)

str(dat_new)
## 'data.frame':    51 obs. of  3 variables:
##  $ state: chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ lat  : num  32.8 64.8 34.9 34.7 37.6 ...
##  $ lon  : num  -86.6 -147.7 -111.8 -92.3 -121 ...
smoke1<-smoke
smoke1 <- left_join(smoke1,dat_new, 
              by = c("State" = "state"))

A. Smoking everyday

library(leaflet)

commu.pal <- colorNumeric(c('darkgreen','goldenrod','brown'), domain=smoke$`Smoke everyday`)
leaflet(smoke1)%>%
  addProviderTiles('CartoDB.VoyagerLabelsUnder')%>%
  addCircles(
    lat = ~lat,lng = ~lon,
    label = ~paste0(round(`Smoke everyday`,2)),color = ~commu.pal(`Smoke everyday`),
    opacity = 1, fillOpacity = 1, radius = 500
  )%>%
  addLegend('bottomleft',pal = commu.pal,values = smoke1$`Smoke everyday`,title = 'The number of people smoking everyday',opacity = 1)
## Warning in validateCoords(lng, lat, funcName): Data contains 64 rows with either
## missing or invalid lat/lon values and will be ignored
  1. “Smoke Some Days”
library(leaflet)

commu.pal <- colorNumeric(c('darkgreen','goldenrod','brown'), domain=smoke1$`Smoke some days`)
leaflet(smoke1)%>%
  addProviderTiles('CartoDB.VoyagerLabelsUnder')%>%
  addCircles(
    lat = ~lat,lng = ~lon,
    label = ~paste0(round(`Smoke some days`,2)),color = ~commu.pal(`Smoke some days`),
    opacity = 1, fillOpacity = 1, radius = 500
  )%>%
  addLegend('bottomleft',pal = commu.pal,values = smoke1$`Smoke some days`,title = 'The number of people smoking some days',opacity = 1)
## Warning in validateCoords(lng, lat, funcName): Data contains 64 rows with either
## missing or invalid lat/lon values and will be ignored
  1. “Former Smoker”
library(leaflet)

commu.pal <- colorNumeric(c('darkgreen','goldenrod','brown'), domain=smoke1$`Former smoker`)
leaflet(smoke1)%>%
  addProviderTiles('CartoDB.VoyagerLabelsUnder')%>%
  addCircles(
    lat = ~lat,lng = ~lon,
    label = ~paste0(round(`Former smoker`,2)),color = ~commu.pal(`Former smoker`),
    opacity = 1, fillOpacity = 1, radius = 500
  )%>%
  addLegend('bottomleft',pal = commu.pal,values = smoke1$`Former smoker`,title = 'The number of people who is former smoker',opacity = 1)
## Warning in validateCoords(lng, lat, funcName): Data contains 64 rows with either
## missing or invalid lat/lon values and will be ignored

D. Never Smoke

library(leaflet)

commu.pal <- colorNumeric(c('darkgreen','goldenrod','brown'), domain=smoke1$`Never smoked`)
leaflet(smoke1)%>%
  addProviderTiles('CartoDB.VoyagerLabelsUnder')%>%
  addCircles(
    lat = ~lat,lng = ~lon,
    label = ~paste0(round(`Never smoked`,2)),color = ~commu.pal(`Never smoked`),
    opacity = 1, fillOpacity = 1, radius = 500
  )%>%
  addLegend('bottomleft',pal = commu.pal,values = smoke1$`Never smoked`,title = 'The number of people who never smoke',opacity = 1)
## Warning in validateCoords(lng, lat, funcName): Data contains 64 rows with either
## missing or invalid lat/lon values and will be ignored

Conclusion: According to the above data analysis and graphic analysis, I found the smoking control policies possibly have improved a lot from 1990 to 2010. It is because the number of smoking people become less, the number of former smoking becomes more, and the number of people who never smoker becomes more. Except this, this trend can be also found in the 5 geographically regions: Northeast, Southwest, West, Southeast, and Midwest. Smokers become less, former smokers become more, and non-smokers becomes more.